Matin takhtesangi aspects of validity
Aspects of test validity 1. Writing Task The scenario involved assessing the basic writing skills of 90 teenage students who were required to discuss their experience of a time they were travelling somewhere on a very long journey. Based upon a preliminary examination, there are some positive aspects to the task. For example, it draws on the personal experiences of the candidates, it should aid in determining their English proficiency in writing, and it will assist future teaching and learning by isolating areas of difficulty, particularly in the case of the second marking approach. Nevertheless, much is unknown about the circumstances surrounding the task that could influence its effectiveness and complicate its analysis in the present paper. These include · the purpose of the task, such as an achievement test or a placement test, · the nature of the students, such as their life experiences, · the syllabus on which it was based, · the stage of the course at which it was administered, · the expected length (in words) of the response, · the time allocated for task completion, · the instructions (if any) provided, and · whether all students were tested simultaneously. (a) Validity Hughes (1990, p. 9) describes “validity” as the ability of a test to provide “consistent measures of precisely the abilities we are interested in.” Various types of validity are discussed in the literature, and those relevant to classroom assessment are examined below. The task appears to have construct validity since it is a direct test where students perform the skill we wish to measure, and since writing is considered a distinct ability (Hughes 1990, p. 31). It may also have face validity, as it seems to measure what it is supposed to (Hughes 1990, p. 33). However, this is somewhat clouded by lack of details concerning task difficulty relative to the level of student proficiency (such as an upper-intermediate task for elementary students), and the appropriateness of the timing allowed for its completion (Brown 2004, p. 34). In addition, it is not known whether the candidates would have perceived the test as having face validity, since it is not stated whether this was a familiar task, if the directions were clear, the time allowed was fair, the level was appropriate, or whether it related to their course (Brown 2004, p. 27). Since a specification of the skills and structures within the course of study is not provided, it is not known whether the task has predictive validity (about future performance) or content validity (that is, if it is a representative sample of the skills meant to be covered) (Hughes 1990, p. 26). A further concern is that the writing sample elicited by the task covers only one of two writing skills, namely the ability to describe an event or to narrate a sequence of events. As such, it is unlikely to provide sufficient data on the basic writing skills of students, which was its stated objective. Compared to other test specifications, such as the Cambridge Certificate in Writing Level 1 (Hughes 1990, p. 84), it will only elicit a small sample of the tasks students are usually required to perform in general English courses. A broader sampling of course content could be obtained by requiring shorter answers to a relatively large number of questions (Jacobs, 2009). Two possible threats to validity may be invalid application of the test (if it was unsuitable for this population) and inappropriate selection of content (such as bias to only part of the curriculum) (Henning 1987, p. 91). However these cannot be confirmed without knowing the purpose of the syllabus and the course objectives, as well as what was actually covered in lessons. In terms of the actual candidates, it is not known whether all of them would have had the experience of travelling on a long journey. In some cultures and many less affluent families, such opportunities may not be available to young people, so they would need to rely more on their imagination than on recall of actual events. Depending on the focus of the marking scheme, this could disadvantage these candidates. Further, the variance in their ages (whether homogenous or distributed) and the length of language training are unknown. These personal and cultural characteristics may negatively impact on task validity. (b) Reliability Here one is interested in identifying and minimising the extent to which test performance is due to errors of measurement or factors other than the language ability that we wish to measure. Examples of such factors include the test method, personal learner attributes, and random factors (Bachmann 1990, pp. 160-161). Although one cannot be sure of the required degree of reliability for this task since its purpose is not stated (more important decisions demand greater reliability) (Hughes 1990, p. 39), there are some general observations which can be made. One positive aspect is that the freedom of candidates has been limited to responding to only one question. According to Hughes (1990, p. 45), this limits differences in performance between candidates on each occasion. Since appropriate details are not given, it is not possible to determine the effects of fluctuations in test administration (such as clarity of instructions and sufficiency of time allocation) or test characteristics (such as expected length of response and difficulty relative to candidate level) (Henning 1987, p. 91) on task reliability. © Authenticity For the present discussion, authenticity may be defined as “the extent to which test tasks replicate real-life language use tasks” (Bachman 1990, p. 307). Using this definition, the writing task may be considered authentic to the extent that candidates wrote about their own experiences using a meaningful topic. Additionally, depending on student circumstances, it may be authentic if success here is a major course goal. Hughes (1990, p. 17) cautions, however, that tests generally cannot be truly authentic since candidates are aware that they are in a test situation, rather than a real-world one. One way of increasing authenticity might be to require students to respond by writing in a diary or journal or by sending an email, both of which are more communicative (and real-life language tasks) than essays. A useful approach to determining the authenticity of an assessment task is provided by Mueller (2009). He considers it important that students demonstrate their understanding by performing complex tasks; demonstrate proficiency by doing something; analyse, synthesise and apply their learning in a substantial way while creating new meaning; and show direct evidence of the application and construction of knowledge. At the same time, they should have some choice in what is presented as evidence. The writing task being discussed here seems to meet Mueller's specifications. (d) Backwash Backwash, defined simply as “the effect that tests have on learning and teaching” (Hughes 1990, p. 53), has been the subject of much debate concerning its existence and previously suffered from limited research evidence (Spratt 2005, p. 8). More recently, Rea-Dickins and Scott (2007, p. 6) attempted to resolve this by concluding that after considering all the research to date, “washback” does exist as a phenomenon. Unfortunately, their definition is not so reassuring. According to them, it is “a context-specific shifting process, unstable, involving changing behaviours in ways which are difficult to predict.” Further, as noted by Alderson and Wall (1993, p. 115), one cannot simply assume that tests will affect teaching and learning at all, but must investigate each situation. Spratt (2005, p. 29) lists the following aspects of the test itself as backwash factors: · its proximity · its stakes · the status of the language it tests · its purpose · the formats it employs · the weighting of individual papers · when the exam was introduced · how familiar it is to teachers In relation to the present task, all of these factors (other than the format) are unknown. Nevertheless, likely beneficial backwash here may come from the use of a direct writing test. In addition, if such writing (based on personal experiences) is a significant aspect of passing the course, it should be beneficial. However, if similar tasks appear regularly in tests, it may have a negative backwash impact, since it will restrict the extent of test preparation across the full range of writing tasks (Hughes 1990, p.54). 2. Marking Approach 1 Of the two given approaches, this one presents the most problems in relation to validity and likely backwash. Concerning reliability, it is not clear whether this approach employs blind marking (such as using student numbers rather than names). If teachers are marking their own students' work without any cross comparisons with colleagues, this may affect fairness and objectivity. Further, since few instructions are provided to markers, there are likely to be differences between them and greater subjectivity in marking. Thirdly, as no time has been set aside in the daily timetable for marking, this may encourage teachers to cut corners by only superficially assessing each paper and minimising feedback. As noted by Henning (1987, p. 76) such intra-rater variance is likely to lead to fluctuations in scoring, reduce test reliability and lead to negative backwash. Even greater difficulties arise with the five aspects of the composition listed for assessment. Firstly, there is no definition of each term, so it is unclear what elements are being looked for in each case. Secondly, the standard of competence in each aspect is unspecified. Thirdly, at least one criterion (“argument”) appears to be irrelevant to the task, which is a descriptive or narrative essay. Fourthly, there is no indication of the relative weighting of each aspect, or whether each is equally weighted. This may result in different weightings between markers. Fifthly, there is no guidance on how to adjust the score for mistakes made. If there is a deduction for each error, this will act as a disincentive to writing longer responses having potentially more errors. Finally, it is not known whether these are the same aspects specified in the course curriculum that students have been studying. All of these matters are likely to adversely affect inter-rater variance (Henning 1987, p. 76) and test reliability. A final concern with the first marking approach is with the limited range of aspects being analysed by raters. By comparison, the Cambridge “O-Level” Examination (2009) looks at nine aspects of writing (accuracy, sentences, verb forms, vocabulary, punctuation, spelling, paragraphs, topic, and tone and register). As Brown and Bailey (1984) state, rubrics having a greater number of sub-scales tend to result in greater overall consistency of scoring. 3. Marking Approach 2 A number of objections raised about the first approach are resolved here, making it a more acceptable approach. In particular, each script is marked twice for greater inter-rater consistency, and any anomalies are resolved mutually. Secondly, the range of sub-skills assessed is expanded to seven (compared to five in the first approach). In addition, there is a detailed three-level marking scheme for each criterion based on published standards (O'Neill & Gish 2008, p. 247). Since this looks at broader trends in writing scripts, it is less likely to dissuade candidates from submitting longer responses than the first approach. Nevertheless, a number of problems remain including a lack of marking time, lack of details on the relative weighting of each criterion, and on the link between sub-skills and the curriculum. Concerning both approaches, there may be negative backwash if students are only provided with a numerical score (Brown 2004, p. 29) rather than detailed feedback, since this gives no guidance on areas requiring effort to improve or reassurance about the fairness of assessment. However, the first approach is likely to have a more severe negative impact due to its greater subjectivity. Students may become discouraged knowing that the marking method is unfair, and that their score will depend on the whims of a particular rater. 4. Recommendations Following this discussion, there are a number of ways in which task validity, reliability, authenticity and backwash may be enhanced, or at least determined more accurately. In particular: · by provision of details about student language level and personal characteristics, the length of response expected, the time allowed, the directions to candidates, and the course specification (validity); · by supply of test administration procedures, use of blind or cross marking procedures, a detailed marking guide based on a wide number of sub-skills and specifying the weighting of each, and allowance for time-out to complete marking (reliability); · by confirming that candidates have had the experience assumed by the task, and that success at the task is a major course goal (authenticity); and · by specifying the purpose of the test, the status of English, teacher familiarity with the task, and giving detailed feedback to candidates (backwash). The following writing task was used to assess the basic writing skills of 90 teenage students. Think of a time when you were travelling somewhere and the journey was very long. Discuss your experience. Marking approach 1 Scripts were divided equally between all teachers working with these students to mark in their non-teaching time. They were asked to provide a score for each of the following language features and then allocate a total score out of 25 and then convert that score to a percentage. (a) spelling (b) punctuation © grammar (d) control (e) argument. Marking approach 2 All scripts were marked by two teachers of English. They considered the following language features: spelling, punctuation, grammar, vocabulary, cohesion, sentence structure and also overall impression. On the basis of quickly reading each script they grouped them on the basis of their overall impression. To mark the various language features they applied a range of criteria which had previously been developed to link with identified levels of proficiency. Thus, multiple scoring was in use and markers were provided with descriptive criteria to guide the allocation of scores. For example, when marking for cohesion they used criteria such as that noted in O'Neill and Gish (2008): · Score 2 when there is use of complex sentences, lack of repetitious use of ‘and' and ‘then', use of more sophisticated cohesive ties such as, however, although, in fact, first, secondly, usually, after, before, as soon as, until, while, eventually, during, meanwhile, thus, consequently and therefore, and there is evidence of use of some variety of cohesive ties and the piece of writing conveys a sense of completeness. · Score 1 when there is either use of predominantly simple sentences or simple sentences and some complex sentences which provide evidence of connecting ideas through the use of basic ties such as and, then, so, but, also, next, when, because and suddenly. · Score 0 when there is little or no evidence of use of cohesive ties and/or connecting of ideas, lack of sense of wholeness, illegible responses or one which is irrelevant to the set topic. When all scripts were marked they compared their scoring and if there were differences they reviewed the script in relation to the marking criteria and arrived at a mutually agreed upon score.